Fix possible deadlock in AWS pubsub #804

DomBlack · 2023-07-19T18:13:40Z

No description provided.

encore-cla · 2023-07-19T18:13:44Z

All committers have signed the CLA.

DomBlack · 2023-07-19T18:15:09Z

runtime/pubsub/internal/utils/workers.go

+	var lastFetch atomic.Pointer[time.Time]
+	var epoch time.Time
+	lastFetch.Store(&epoch)


There was a race on this, so I've moved it into an atomic

DomBlack · 2023-07-19T18:17:02Z

runtime/pubsub/internal/utils/workers.go

+				case <-fetchCtx.Done():
+					return


I think this was the deadlock;

If all the workers panic'ed and quit, that could have resulted in the fetch processor trying to write onto the workChan but nothing pulling and reading them.

I've solved this by adding this new select, so if the ctx is done, we don't even try to push to the channel & I've also added additional panic recovery wrappers at other points in the code.

DomBlack · 2023-07-19T18:18:11Z

runtime/pubsub/internal/aws/topic.go

+					// We should only long poll for 20 seconds, so if this takes more than
+					// 30 seconds we should cancel the context and try again
+					//
+					// We do this incase the ReceiveMessage call gets stuck on the server
+					// and doesn't return
+					ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
+					defer cancel()


One theory I have is the AWS library might be stalling and blocking under high load, so I've introduced a smaller timeout to try and cause a context cancelled error

eandre · 2023-07-19T18:16:46Z

runtime/pubsub/internal/aws/topic.go

+
+					// Check if the context has been cancelled, and if so, return the error
+					if ctx.Err() != nil {
+						return ctx.Err()


Why? This might hide other errors from the func if they error for some other reason

eandre · 2023-07-19T18:17:41Z

runtime/pubsub/internal/aws/topic.go

 						// If there was an error processing the message, apply the backoff policy
 						_, delay := utils.GetDelay(retryPolicy.MaxRetries, retryPolicy.MinBackoff, retryPolicy.MaxBackoff, uint16(deliveryAttempt))
-						_, visibilityChangeErr := t.sqsClient.ChangeMessageVisibility(t.ctx, &sqs.ChangeMessageVisibilityInput{
+						_, visibilityChangeErr := t.sqsClient.ChangeMessageVisibility(ctx, &sqs.ChangeMessageVisibilityInput{


Shouldn't we do this with a context not derived from the input, so we do this even if the input context is canceled?

No I delribatley wanted both these API calls to release when the fetch context is cancelled, as we immediately go into a loop to go again.

I think them being based on the t.ctx, rather than the fetch context could have been an issue; as we never cancel the t.ctx, but the fetchCtx is cancelled when we want to exit the WorkConcurrently code

eandre · 2023-07-19T18:17:57Z

runtime/pubsub/internal/aws/topic.go

-						_, err = t.sqsClient.DeleteMessage(t.ctx, &sqs.DeleteMessageInput{
+					} else {
+						// If the message was processed successfully, delete it from the queue
+						_, err = t.sqsClient.DeleteMessage(ctx, &sqs.DeleteMessageInput{


eandre · 2023-07-19T18:18:26Z

runtime/pubsub/internal/utils/workers.go

+	fetchWithPanicHandling := func(ctx context.Context, maxToFetch int) (work []Work, err error) {
+		defer func() {
+			if r := recover(); r != nil {
+				err = errs.B().Msgf("panic: %v", r).Err()


Can we include the stack like we do elsewhere?

errs.B() will build in a stack no?

runtime/pubsub/internal/utils/workers.go

Fix possible deadlock in AWS pubsub

2594652

DomBlack force-pushed the possible-deadlock-removal branch from 68c47b5 to 2594652 Compare July 19, 2023 18:14

DomBlack commented Jul 19, 2023

View reviewed changes

eandre reviewed Jul 19, 2023

View reviewed changes

DomBlack merged commit 82235d7 into main Jul 19, 2023
2 of 3 checks passed

DomBlack deleted the possible-deadlock-removal branch July 19, 2023 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix possible deadlock in AWS pubsub #804

Fix possible deadlock in AWS pubsub #804

DomBlack commented Jul 19, 2023

encore-cla bot commented Jul 19, 2023

DomBlack Jul 19, 2023

DomBlack Jul 19, 2023

DomBlack Jul 19, 2023

eandre Jul 19, 2023

eandre Jul 19, 2023

DomBlack Jul 19, 2023

eandre Jul 19, 2023

eandre Jul 19, 2023

DomBlack Jul 19, 2023

Fix possible deadlock in AWS pubsub #804

Fix possible deadlock in AWS pubsub #804

Conversation

DomBlack commented Jul 19, 2023

encore-cla bot commented Jul 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment